Tag

#coding evaluation

2 articles

Separating signal from noise in coding evaluations

OpenAI's analysis reveals significant methodological flaws in SWE-Bench Pro, a popular coding benchmark, raising concerns about the reliability of AI model evaluations.

Jul 818

OpenAI wants to retire the AI coding benchmark that everyone has been competing on

OpenAI plans to retire the SWE-bench Verified benchmark, citing flaws that undermine its validity as a coding performance measure. The move highlights concerns about memorization in AI model evaluations.

Feb 2382